[RayCluster controller] [Bug] Unconditionally reconcile RayCluster every 60s instead of only upon change #850

architkulkarni · 2022-12-30T01:11:35Z

Signed-off-by: Archit Kulkarni [email protected]

Why are these changes needed?

This PR fixes an issue where RayJobs would never be submitted even if the Ray Cluster was up and running. Debugged this with @brucez-anyscale .

The root cause is the following:

The cluster status is only marked "Ready" if all pods are RUNNING. The cluster state only gets updated when a reconcile happens.

However, the Cluster reconcile doesn't get requeued if all the pods are in [RUNNING, PENDING] state.

In particular, when the head pod becomes PENDING, the cluster never gets reconciled again.

So the cluster is stuck in a state where it's status is not "Ready". Since the RayJob reconciler waits for the cluster status to be "Ready" before submission, the job hangs and is never submitted.

The fix in this PR is to ~~requeue the cluster reconcile in the case where the cluster pods are not all running yet. This PR introduces a new cluster state "Pending" for this case.~~ always requeue the cluster reconciliation (every 300s, for now.)

As for why the head node state change PENDING -> RUNNING didn't trigger a cluster reconcile before, we think it might be due to an event filter

kuberay/ray-operator/controllers/ray/raycluster_controller.go

Line 803 in 633ff63

    
           WithEventFilter(predicate.Or(predicate.GenerationChangedPredicate{}, predicate.LabelChangedPredicate{}, predicate.AnnotationChangedPredicate{})).

which only does the reconcile when there's a config, label or annotation change. However we can't remove this filter because it's necessary to prevent reconciling in a tight loop, because instance.Status.LastUpdateTime is updated at each reconcile.

Conclusion after discussion on this PR: Fixing the event filters is a separate but related bug and will be addressed in a followup PR, the issue will be tracked here: #872. Once that is fixed, we could possibly consider a patch 0.4.1 release. After this PR, there may be an up to 60s unnecessary delay between when the cluster is ready and when the Job is submitted, which is not ideal.

Related issue number

Closes #845

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests

      # This script is used to rebuild the KubeRay operator image and load it into the Kind cluster. 
# Step 0: Delete the existing Kind cluster
kind delete cluster 

# Step 1: Create a Kind cluster
kind create cluster --image=kindest/node:v1.24.0

# Step 2: Modify KubeRay source code
# For example, add a log "Hello KubeRay" in the function `Reconcile` in `raycluster_controller.go`.

# Step 3: Build a Docker image
#         This command will copy the source code directory into the image, and build it.
# Command: IMG={IMG_REPO}:{IMG_TAG} make docker-build
IMG=kuberay/operator:nightly make docker-build

# Step 4: Load the custom KubeRay image into the Kind cluster.
# Command: kind load docker-image {IMG_REPO}:{IMG_TAG}
kind load docker-image kuberay/operator:nightly

# Step 5: Keep consistency
# If you update RBAC or CRD, you need to synchronize them.
# See the section "Consistency check" for more information.

# Step 6: Install KubeRay operator with the custom image via local Helm chart
# (Path: helm-chart/kuberay-operator)
# Command: helm install kuberay-operator --set image.repository={IMG_REPO} --set image.tag={IMG_TAG} .
helm install kuberay-operator --set image.repository=kuberay/operator --set image.tag=nightly ~/kuberay/helm-chart/kuberay-operator

# Step 7: Submit a RayJob
kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml   

# Step 8: Check the status of the pods
watch kubectl get pods

# When the pods are ready, check `kubectl logs` to see that the job has status SUCCEEDED

This PR is not tested :(

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni · 2022-12-30T01:12:24Z

Would appreciate any hints on how to add a unit test for this, or whether this is the right approach!

ray-operator/controllers/ray/raycluster_controller.go

kevin85421 · 2022-12-30T07:29:13Z

Would appreciate any hints on how to add a unit test for this, or whether this is the right approach!

Maybe we can refer to the existing tests in [1][2] and write tests to ensure instance.Status.State update correctly.

[1] https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/raycluster_controller_fake_test.go
[2] https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/raycluster_controller_test.go

DmitriGekhtman · 2022-12-30T20:28:26Z

ray-operator/controllers/ray/raycluster_controller.go

@@ -233,6 +233,9 @@ func (r *RayClusterReconciler) rayClusterReconcile(request ctrl.Request, instanc
 			r.Log.Error(err, "Update status error", "cluster name", request.Name)
 		}
 	}
+	if instance.Status.State == rayiov1alpha1.Pending {
+		return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, nil
+	}

 	return ctrl.Result{}, nil


We should always requeue. I didn't realize we didn't do that. That's very bad. Reconciliation for any controller should happen periodically even if there's isn't a trigger that demands it.

The correct fix for this issue is to change the "successful return" from

return ctrl.Result{}, nil

to

return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, nil

Oh thanks, that's very good to know. Let me update the PR

This is a good point.

DmitriGekhtman · 2022-12-30T20:39:04Z

ray-operator/controllers/ray/raycluster_controller.go

@@ -858,6 +861,8 @@ func (r *RayClusterReconciler) updateStatus(instance *rayiov1alpha1.RayCluster)
 	} else {
 		if utils.CheckAllPodsRunnning(runtimePods) {
 			instance.Status.State = rayiov1alpha1.Ready
+		} else {
+			instance.Status.State = rayiov1alpha1.Pending


It's normal for a Ray cluster to gain pods that are in Pending state, particularly if autoscaling is enabled. With this change, the cluster could switch back and forth between Running and Pending. That seems undesirable. (The current reporting of state is not great either.)

@DmitriGekhtman
How do we differentiate between Pods are that are Pending due to autoscaler vs Pods being Pending during initial cluster setup?

By the following funky logic:
The only cluster State transition in the current logic is "" -> Ready.
If, after the initial cluster setup, we have a Pending pod due to scale change or failure, the State will not return to "".

DmitriGekhtman

I'd leave the state reporting alone for now (maybe track the issue of making it more sensible).

We should change things so that we always requeue, i.e. do
return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, nil

DmitriGekhtman · 2022-12-30T20:47:30Z

As for why the head node state change PENDING -> RUNNING didn't trigger a cluster reconcile before, we think it might be due to an event filter

I am fuzzy on how these filters work, but the way things should work is that
a RayCluster.status change doesn't trigger a RayCluster reconcile
but
a Pod.status change does trigger a reconcile for the RayCluster if the pod is in the Ray cluster.

Currently, it might be that neither event triggers a reconcile.

This reverts commit 368dd36.

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni · 2022-12-30T21:22:43Z

@DmitriGekhtman Updated to always requeue the RayCluster reconciler. After this, it seems the RayJob reconciler also gets called every few seconds even after the job has succeeded. I'm not exactly sure what is causing the RayJob reconciler to get requeued, but it sounds like this is what we want anyway.

There's ~10 other instances of ctrl.Result{}, nil in the codebase but I'll leave those alone for now.

DmitriGekhtman · 2022-12-30T21:24:20Z

I'm not exactly sure what is causing the RayJob reconciler to get requeued, but it sounds like this is what we want anyway.

Probably the RayCluster status update is causing that.

DmitriGekhtman · 2022-12-30T21:28:00Z

We could say testing for the fix falls under the umbrella of e2e testing for Ray Jobs which is tracked in #763.

DmitriGekhtman · 2022-12-30T21:41:04Z

I guess by default it triggers once every 10 hrs
https://stackoverflow.com/questions/72033429/how-often-is-reconciliation-loop-run-in-kubebuilder
which seems to me way too long for any use-case.

I'm not sure if once every 2 seconds is too frequent to requeue by default, but I think this should work fine for now.

In a follow-up, we should fix the event filters to ensure that pod changes trigger immediate reconciliation.

Jeffwan · 2022-12-30T23:39:48Z

one thing I like to confirm is

In particular, when the head pod becomes PENDING, the cluster never gets reconciled again.

in this case, is it meaningful to submit the job?

if head is running and partical workers are pending. I think move forward job submission makes sense. We also need to be careful that we have to honor gang setting as well

Jeffwan · 2022-12-30T23:43:02Z

the latest change actually will reconcile an object which reaches the end state and it's meaningless. I would say it brings side effect especially when there're lots of CRs in the cluster. I think there're better ways to solve the problem(make the changes specific to job instead of clusters globally)?

DmitriGekhtman · 2022-12-31T05:45:53Z

the latest change actually will reconcile an object which reaches the end state and it's meaningless. I would say it brings side effect especially when there're lots of CRs in the cluster. I think there're better ways to solve the problem(make the changes specific to job instead of clusters globally)?

I think we messed up the event filters in a previous PR -- we should fix the filters so that changes to RayCluster status are ignored but changes to status of other watched resources (such as pods) are not ignored.
I'm not sure how to do that but it should be possible.

About periodically re-triggering reconciliation --
There isn't an end-state for a Ray cluster, but rather a goal state.
If you don't periodically trigger reconciliation, state can get desynchronized due to
missed events (unlikely)
or
bugs in the operator code (more likely).
For that reason, it's a standard behavior of controllers to occasionally reconcile without an event trigger. By default, controllers written using kubebuilder will do this once every 10 hours, which seems to me too infrequent for our needs. Once every two seconds is probably too frequent. We can consider using something in between, maybe in a different PR.

DmitriGekhtman · 2023-01-01T21:02:39Z

So, again, given the above discussion -- my current opinion is that we should try to solve the issue by fixing the event filter so that pod status changes trigger reconcile.

Separately, we can consider if we want to always requeue after 30 seconds or a minute.

ray-operator/controllers/ray/raycluster_controller.go

kevin85421 · 2023-01-04T03:18:15Z

Reconcile on pod status change but not RayCluster status change

Hi @DmitriGekhtman, I am not sure whether the Pod status change is enough or not. For example, if a ClusterIP service for a head Pod is deleted accidently and the head Pod's status does not change, the service may not be created again.

architkulkarni · 2023-01-18T00:29:53Z

Created #872 to track the filter issue.

Once this PR is merged I will open an issue to track increasing 2 seconds to 30 or 60 seconds. We should do it after #872 is fixed, otherwise it will introduce a 30 or 60 second delay at the start of each RayJob.

architkulkarni · 2023-01-18T00:32:15Z

Reconcile on pod status change but not RayCluster status change

Hi @DmitriGekhtman, I am not sure whether the Pod status change is enough or not. For example, if a ClusterIP service for a head Pod is deleted accidently and the head Pod's status does not change, the service may not be created again.

I guess with the 2 or 30-60 second automatic reconcile, the service would be created again--does this resolve the concern? So we would have both this automatic reconciling as well as the reconciling upon Pod status change.

DmitriGekhtman · 2023-01-18T00:38:27Z

Reconcile on pod status change but not RayCluster status change

Hi @DmitriGekhtman, I am not sure whether the Pod status change is enough or not. For example, if a ClusterIP service for a head Pod is deleted accidently and the head Pod's status does not change, the service may not be created again.

I guess with the 2 or 30-60 second automatic reconcile, the service would be created again--does this resolve the concern? So we would have both this automatic reconciling as well as the reconciling upon Pod status change.

Right, I'd recommend both fixing the pod status trigger and requeueing after a minute after successful reconcile.

DmitriGekhtman

Sry, "Requesting changes" again to indicate my intent -- @Jeffwan had a good point that unconditionally triggering reconcile every two seconds is overkill.

Instead we should

Unconditionally trigger reconcile once a minute. (Modify the 2 seconds in the currently altered line to 60 seconds.)
Make sure pod status changes trigger reconcile, by fixing the event filter.

Once this is fixed, we could possibly consider a patch 0.4.1 release to fix Ray Jobs.

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni · 2023-01-20T20:56:01Z

Manually tested so @DmitriGekhtman I think this is ready to be merged. I think it makes sense to fix the filter issue #872 in a second PR instead of this PR because they're conceptually separate changes, but I don't feel strongly either way.

…reconcile-when-pending

DmitriGekhtman · 2023-01-21T05:13:44Z

Manually tested so @DmitriGekhtman I think this is ready to be merged. I think it makes sense to fix the filter issue #872 in a second PR instead of this PR because they're conceptually separate changes, but I don't feel strongly either way.

Makes sense to leave the filter fix for later. For posterity, could you update the PR title and description to explain the changes?

architkulkarni · 2023-01-23T17:13:32Z

@DmitriGekhtman Ah thanks for the reminder--should be good to go now.

akanso · 2023-01-23T18:50:17Z

For customers that are using KubeRay and managing 100's or thousands of clusters, a 60 reconciliation loop would have a big impact on the operator itself, and on the traffic to the k8s Api-server.

I think this should be a configurable attribute, we can default it to 5 minutes, for example we can pass an Env var to the operator that can specify a different frequency instead of hardcoding it.

architkulkarni · 2023-01-23T19:27:52Z

@akanso That makes sense to me, so something like this? (cc @DmitriGekhtman)

var requeueAfterSeconds int
requeueAfterSeconds, err := strconv.Atoi(os.Getenv("RAYCLUSTER_DEFAULT_RECONCILE_LOOP_S"))
if err != nil {
    requeueAfterSeconds = 5 * 60
}
return ctrl.Result{RequeueAfter: time.Duration(requeueAfterSeconds) * time.Second}, nil

And how would an environment variable typically be passed to the operator?

akanso · 2023-01-23T19:46:17Z

@akanso That makes sense to me, so something like this? (cc @DmitriGekhtman)
var requeueAfterSeconds int
requeueAfterSeconds, err := strconv.Atoi(os.Getenv("RAYCLUSTER_DEFAULT_RECONCILE_LOOP_S"))
if err != nil {
    requeueAfterSeconds = 5 * 60
}
return ctrl.Result{RequeueAfter: time.Duration(requeueAfterSeconds) * time.Second}, nil
And how would an environment variable typically be passed to the operator?

We can simply pass it in the k8s Yaml manifest of the operator pod. if it is missing we default to 5 minutes.
we would want to the code snippet a log message as well.

DmitriGekhtman · 2023-01-23T20:28:55Z

5 minute default sounds good
(anything on the order of "eventually" is fine, 10 hrs is no good)

DmitriGekhtman · 2023-01-23T20:30:58Z

For customers that are using KubeRay and managing 100's or thousands of clusters, a 60 reconciliation loop would have a big impact on the operator itself, and on the traffic to the k8s Api-server.

Doesn't reconcile trigger on any pod spec change, though?
(Or is it just the pods that have a RayCluster ownerref?)

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni · 2023-01-23T20:57:32Z

Successfully tested manually with the latest change.

akanso · 2023-01-24T00:05:30Z

Doesn't reconcile trigger on any pod spec change, though?

it does. But imagine kuberay is down or restarting, then it will/can miss an event, such as the creation/modification/deletion of a new raycluster, this reconcile logic will make sure missed event are processed. The events already processed will be ignored since the desired state is already satisfied.

DmitriGekhtman · 2023-01-24T05:30:23Z

I meant that the logic for 5 min vs. 1 min for performance reasons in a large cluster might not make sense, given that a large cluster has many pod changes happening -- these pod changes trigger many reconciles anyway.
(But 5 min sgtm.)

architkulkarni · 2023-01-24T15:05:43Z

@DmitriGekhtman "Go-build-and-test / Compatibility Test - Nightly (pull_request)" is failing, but I think this is known to be flaky. Maybe we can bypass the test and merge this PR anyway?

DmitriGekhtman · 2023-01-24T15:59:56Z

Failing test is known to be flakey.

…ery 60s instead of only upon change (ray-project#850) Have the RayCluster controller resume once per 5 minutes. Signed-off-by: Archit Kulkarni <[email protected]>

Reconcile when cluster pods are Pending

368dd36

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni requested review from Jeffwan, sihanwang41, kevin85421, scarlet25151, DmitriGekhtman and brucez-anyscale December 30, 2022 01:11

kevin85421 reviewed Dec 30, 2022

View reviewed changes

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved

DmitriGekhtman reviewed Dec 30, 2022

View reviewed changes

DmitriGekhtman requested changes Dec 30, 2022

View reviewed changes

architkulkarni added 2 commits December 30, 2022 12:55

Revert "Reconcile when cluster pods are Pending"

37fec45

This reverts commit 368dd36.

Requeue RayCluster reconciliation if "successful"

f4486df

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni requested a review from DmitriGekhtman December 30, 2022 21:23

DmitriGekhtman approved these changes Dec 30, 2022

View reviewed changes

DmitriGekhtman requested a review from kevin85421 December 30, 2022 21:31

kevin85421 approved these changes Dec 31, 2022

View reviewed changes

kevin85421 reviewed Jan 4, 2023

View reviewed changes

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved

DmitriGekhtman requested changes Jan 18, 2023

View reviewed changes

Change unconditional requeue from 2s to 60s

251a18f

Signed-off-by: Archit Kulkarni <[email protected]>

Merge branch 'master' of https://github.com/ray-project/kuberay into …

8daedca

…reconcile-when-pending

architkulkarni changed the title ~~[RayCluster controller] [Bug] Reconcile cluster when cluster pods are Pending~~ [RayCluster controller] [Bug] Unconditionally reconcile RayCluster every 60s instead of only upon change Jan 23, 2023

architkulkarni mentioned this pull request Jan 23, 2023

fix: only filter RayCluster events for reconciliation #882

Merged

4 tasks

Use env var with default 300s and log

9d50cbd

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni requested a review from DmitriGekhtman January 23, 2023 20:57

DmitriGekhtman approved these changes Jan 24, 2023

View reviewed changes

DmitriGekhtman merged commit d1eeaab into ray-project:master Jan 24, 2023

davidxia mentioned this pull request Jan 25, 2023

mark clusters with pending Pods as pending #875

Closed

4 tasks

architkulkarni mentioned this pull request Apr 26, 2023

[Feature] Support suspend in RayJob #926

Merged

4 tasks

architkulkarni mentioned this pull request Jun 26, 2023

[RayJob] Submit job using K8s job instead of checking Status and using DashboardHTTPClient #1177

Merged

4 tasks

architkulkarni mentioned this pull request Oct 20, 2023

[RayJob] Fix RayJob status reconciliation #1539

Merged

4 tasks

[RayCluster controller] [Bug] Unconditionally reconcile RayCluster every 60s instead of only upon change #850

[RayCluster controller] [Bug] Unconditionally reconcile RayCluster every 60s instead of only upon change #850

Conversation

architkulkarni commented Dec 30, 2022 • edited by DmitriGekhtman Loading

Why are these changes needed?

Related issue number

Checks

architkulkarni commented Dec 30, 2022

kevin85421 commented Dec 30, 2022

DmitriGekhtman Dec 30, 2022

Choose a reason for hiding this comment

architkulkarni Dec 30, 2022

Choose a reason for hiding this comment

brucez-anyscale Dec 31, 2022

Choose a reason for hiding this comment

DmitriGekhtman Dec 30, 2022

Choose a reason for hiding this comment

gvspraveen Jan 22, 2023

Choose a reason for hiding this comment

DmitriGekhtman Jan 22, 2023

Choose a reason for hiding this comment

DmitriGekhtman left a comment

Choose a reason for hiding this comment

DmitriGekhtman commented Dec 30, 2022 • edited Loading

architkulkarni commented Dec 30, 2022

DmitriGekhtman commented Dec 30, 2022

DmitriGekhtman commented Dec 30, 2022

DmitriGekhtman commented Dec 30, 2022

Jeffwan commented Dec 30, 2022

Jeffwan commented Dec 30, 2022 • edited Loading

DmitriGekhtman commented Dec 31, 2022 • edited Loading

DmitriGekhtman commented Jan 1, 2023

kevin85421 commented Jan 4, 2023

architkulkarni commented Jan 18, 2023

architkulkarni commented Jan 18, 2023

DmitriGekhtman commented Jan 18, 2023

DmitriGekhtman left a comment

Choose a reason for hiding this comment

architkulkarni commented Jan 20, 2023

DmitriGekhtman commented Jan 21, 2023 • edited Loading

architkulkarni commented Jan 23, 2023

akanso commented Jan 23, 2023

architkulkarni commented Jan 23, 2023

akanso commented Jan 23, 2023

DmitriGekhtman commented Jan 23, 2023 • edited Loading

DmitriGekhtman commented Jan 23, 2023 • edited Loading

architkulkarni commented Jan 23, 2023

akanso commented Jan 24, 2023 • edited Loading

DmitriGekhtman commented Jan 24, 2023

architkulkarni commented Jan 24, 2023

DmitriGekhtman commented Jan 24, 2023

architkulkarni commented Dec 30, 2022 •

edited by DmitriGekhtman

Loading

DmitriGekhtman commented Dec 30, 2022 •

edited

Loading

Jeffwan commented Dec 30, 2022 •

edited

Loading

DmitriGekhtman commented Dec 31, 2022 •

edited

Loading

DmitriGekhtman commented Jan 21, 2023 •

edited

Loading

DmitriGekhtman commented Jan 23, 2023 •

edited

Loading

DmitriGekhtman commented Jan 23, 2023 •

edited

Loading

akanso commented Jan 24, 2023 •

edited

Loading